Make Timeout repeatedly interrupt a stuck thread, and note that this is happening #48

jglick · 2017-12-12T23:27:30Z

Saw a support bundle that showed a thread inside

"jenkins.util.Timer [#3] / waiting for JNLP4-connect connection from …" …
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at hudson.remoting.Request.call(Request.java:165)
    - locked <0x…> (a hudson.remoting.UserRequest)
    at hudson.remoting.Channel.call(Channel.java:903)
    at hudson.FilePath.act(FilePath.java:987)
    at hudson.FilePath.act(FilePath.java:976)
    at hudson.FilePath.deleteRecursive(FilePath.java:1178)
    at org.jenkinsci.plugins.durabletask.FileMonitoringTask$FileMonitoringController.cleanup(FileMonitoringTask.java:187)
    at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:326)
    at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:279)
    at …

From that it was unclear whether this one call to FilePath.deleteRecursive was in fact taking more than the 10s allotted, or if there were lots of repeated calls to run/check, or what. This patch

Ensures that even if some code swallows InterruptedException, a new signal will be sent.
Makes sure something appears in the system log when a thread is blowing past its timeout.

I tried to also set the thread name to indicate how stalled it is, to allow this information to appear in thread dumps, but could not get it to work—whether due to a JVM bug or some oversight on my part, I am not sure.

@reviewbybees

…is happening.

oleg-nenashev · 2017-12-13T00:42:14Z

From that it was unclear whether this one call to FilePath.deleteRecursive was in fact taking more than the 10s allotted, or if there were lots of repeated calls to run/check, or wha

Object hashes were same across 9 threaddumps with a 5-sec interval, so most likely this is the first case

abayer

Seems reasonable to me, but I'd like @svanoort's thoughts.

jglick · 2017-12-13T15:34:50Z

Object hashes were same

Which object hashes? UserRequest? If so, that suggests that Channel.call is ignoring thread interruptions. I do not see that happening but maybe I am missing something.

Anyway, @reviewbybees done for now.

jglick · 2017-12-14T14:17:49Z

By the way before implementing this (so using the original Timeout behavior) I did some crude testing: ran ExecutorStepTest.buildShellScriptAcrossRestart, which launches a JNLP agent and runs a sh on it, but modified to sleep longer to give me time to intervene; while the process was running, STOPped the agent JVM so it would cease to respond; and after getting the expected

Cannot contact dumbo: java.lang.InterruptedException

message, waited. The timeout was enforced to the extent that individual calls to check in fact ran for just 10s. The build continued to wait for a response (checking repeatedly), as is currently designed—Pipeline does not treat a hung or even crashed agent JVM as fatal, since the same JVM or a replacement could come back and resume status updates about the external process. (I am not sure if @svanoort has plans to change this behavior or offer alternative durability strategies.) I then interrupted the build through the UI and it correctly ABORTED after printing

Could not connect to dumbo to send interrupt signal to process

with no further human intervention.

Could probably develop this into an automated test in workflow-durable-task-step with a little work, though I am more interested in finding scenarios where recovery does not happen as expected. In other words the repeated interruption portion of this patch is a speculative fix, not something I was able to demonstrate is necessary; the enhanced logging could in the future be used to rule out possible explanations for hangs.

svanoort

I don't have a strong feeling one way or the other -- it's hard to tell what the impact will be until we see it in the wild. I'm slightly nervous about using Timer threads more, given issues we've seen with the bounded threadpool, but also trying harder to kill threads seems smart too?

The interactions with the pipeline threading model are hard to predict with precision.

I do think we want to think hard about simplifying the pipeline threading model in the mid-term future because the chain of causality has become rather esoteric.

@jglick I have no plans to modify that in durability as of yet.

abayer · 2017-12-14T14:34:54Z

Fwiw, I think this is a problem that shows up on ci.jenkins.io pretty often, so a fix is nice to see.

jglick · 2017-12-14T15:45:44Z

it's hard to tell what the impact will be until we see it in the wild

Correct.

I'm slightly nervous about using Timer threads more

Not sure what you mean. Timeout was already using Timer threads to deliver interrupts. This patch could cause more tasks to be scheduled (if something else is indeed very wrong), but each one should run essentially instantaneously, and nothing waits for it.

think hard about simplifying the pipeline threading model

If you have any suggestions let me know, but note that jenkinsci/workflow-durable-task-step-plugin#21 would eliminate most of the checks discussed here (I left in a 5m ping to allow recovery from certain corner cases).

this is a problem that shows up on ci.jenkins.io pretty often

If you have the ability to track down particular instances and investigate, that would be important work. I for one have no permissions on the server.

…as a support bundle showed that all Timer threads were occupied.

stale

jglick · 2017-12-14T21:14:19Z

taking more than the 10s allotted

Turns it out was taking longer than 10s—but no interruption was being delivered, because the other nine Timer threads were also occupied! Oops.

oleg-nenashev

🐝 nice catch

jglick · 2017-12-14T21:45:58Z

@reviewbybees done FWIW

jglick · 2017-12-14T22:31:04Z

BTW the actual case of the agent-side hang was believed to be JENKINS-39179 in this case.

svanoort

Ooh, I like it with the new modification. Yeah, this is some good stuff.

Make Timeout repeatedly interrupt a stuck thread, and note that this …

6e8ffae

…is happening.

jglick requested review from abayer and svanoort December 12, 2017 23:27

abayer previously approved these changes Dec 13, 2017

View reviewed changes

svanoort previously approved these changes Dec 14, 2017

View reviewed changes

Switching a different executor service for delivering interruptions, …

0c9d73d

…as a support bundle showed that all Timer threads were occupied.

jglick requested review from abayer, oleg-nenashev and svanoort December 14, 2017 21:13

oleg-nenashev approved these changes Dec 14, 2017

View reviewed changes

svanoort approved these changes Dec 18, 2017

View reviewed changes

abayer approved these changes Dec 18, 2017

View reviewed changes

abayer merged commit 5ea4e13 into jenkinsci:master Dec 18, 2017

jglick deleted the Timeout branch December 18, 2017 16:06

Uh oh!

Make Timeout repeatedly interrupt a stuck thread, and note that this is happening #48

Make Timeout repeatedly interrupt a stuck thread, and note that this is happening #48

Uh oh!

Conversation

jglick commented Dec 12, 2017

Uh oh!

oleg-nenashev commented Dec 13, 2017

Uh oh!

abayer left a comment

Choose a reason for hiding this comment

Uh oh!

jglick commented Dec 13, 2017

Uh oh!

jglick commented Dec 14, 2017

Uh oh!

svanoort left a comment

Choose a reason for hiding this comment

Uh oh!

abayer commented Dec 14, 2017

Uh oh!

jglick commented Dec 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jglick commented Dec 14, 2017

Uh oh!

oleg-nenashev left a comment

Choose a reason for hiding this comment

Uh oh!

jglick commented Dec 14, 2017

Uh oh!

jglick commented Dec 14, 2017

Uh oh!

svanoort left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jglick commented Dec 14, 2017 •

edited

Loading